Back

Genome Biology

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match Genome Biology's content profile, based on 555 papers previously published here. The average preprint has a 0.30% match score for this journal, so anything above that is already an above-average fit.

1
TopOmics: Topic Modelling for All Omics

Sanguinetti, G.; El Kazwini, N.; Caretti, F.

2026-05-29 bioinformatics 10.64898/2026.05.26.727810 medRxiv
Top 0.1%
18.4%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWTopic models have emerged as a popular paradigm to analyse and interpret complex single-cell and spatial data. Yet, current implementations are usually data-type specific and rely on different modelling and estimation approaches, hindering usability and interoperability. In this work we introduce TopOmics, a library to perform efficient and flexible topic modeling with any combination of -omics data at scale. The framework leverages standard libraries of the Python ecosystem, guaranteeing seamless integration with existing pipelines, and shows competitive performance against state-of-the-art methods while preserving interpretability. We provide several examples of TopOmics on diverse data sets, including a novel topic model for spatial multi-omic data, and an analysis of a very large VisiumHD data set.

2
Monju: Multi-criteria clustering in single-cell omics

Kaneko, T.; Sakaguchi, S.; Fujioka, S.; Yada, Y.; Kojima, R.; Naoki, H.

2026-06-01 bioinformatics 10.64898/2026.05.28.728427 medRxiv
Top 0.1%
17.9%
Show abstract

Clustering is a fundamental step in single-cell omics analysis. Although single-cell omics data can, in principle, be partitioned according to multiple biologically meaningful criteria, existing methods typically cluster cells using a single criterion. To address this problem, we developed Monju, a multi-criteria clustering method based on a deep generative mixture model. Monju divides cells into biologically reasonable submodels, each of which is equipped with an interpretable latent space. Furthermore, although the partitioning of cells into submodels varies across random seeds, each solution remains biologically plausible, collectively yielding multi-criteria clustering. Moreover, by integrating these multiple clustering solutions to perform meta-clustering, Monju enables the assessment of cluster stability. We applied Monju to human peripheral blood CITE-seq data and demonstrated that it can achieve multi-criteria clustering. Monju therefore provides a powerful and practical framework for dissecting cellular heterogeneity from multiple biological perspectives.

3
Informational blueprints reveal condition-dependent gene regulatory architectures

Gokmen, D. E.; Pan, R. W.; Roeschinger, T.; Quake, S.; Garcia, H.; Phillips, R.; Vitelli, V.

2026-05-20 genomics 10.64898/2026.05.18.726006 medRxiv
Top 0.1%
14.5%
Show abstract

While coding regions in the genome have a direct interpretation in terms of protein products, significant fractions are non-coding and yet control essential biological functions. Unlike the genetic code, there is no "lookup table" that identifies where regulatory proteins, known as transcription factors (TFs), bind. Here, we extract these binding sites by distilling sequences of nucleotide letters into collective coordinates (hyperletters) representing the binding sites that are active under specific environmental conditions. Going beyond local information footprints between individual bases and expression levels, our information blueprint algorithm compresses the global information by optimising filters that simultaneously scan an entire promoter sequence. Inspired by renormalisation-group techniques, we identify TF binding sites as coarse-grained variables combining groups of correlated mutations with the highest collective impact on gene expression. We validate our approach on experimental data for E. coli and discover novel regulatory elements illustrating its deployment at scale across growth conditions.

4
Tandem: a bioinformatics tool for detection, mechanism classification, and population quantification of bacterial tandem gene duplications

Ngan, W. Y.; Smith, E. S. J.

2026-05-26 bioinformatics 10.64898/2026.05.22.727201 medRxiv
Top 0.1%
14.3%
Show abstract

MotivationTandem gene duplication drives antibiotic resistance, metabolic adaptation, and gene-family expansion in bacteria, but no tool detects them in reference genomes, discovers their junctions in isolate sequencing, and quantifies the junctions in population samples. Existing callers (e.g. breseq) detect duplications without classifying formation mechanisms and often fail to quantify the duplication. ResultsTandem has 3 modules. Module 1 detects reference-genome duplications by NUCmer self-alignment and classifies each by homologous-recombination signature and the junction microhomology length. Module 2 confirms junctions in whole-genome sequencing at user-nominated coordinates after user inspecting the coverage plot. Module 3 quantifies known junction in population sequencing using the novel Junction Read Ratio (JRR). On 280 artificial population tests across seven bacterial species, Tandem achieves 100% recall and 4.3% mean absolute error. Applied to experimentally evolved Pseudomonas fluorescens SBW25 populations, Tandem resolves multiple co-segregating duplication fragments. AvailabilitySource code, documentation, and test data are available under the MIT License at https://github.com/yuingan/tandem. Implemented in Python 3. Requires NUCmer (MUMmer4), minimap2, and samtools.

5
Using Mapping-Profiles to Refine Strain-Level Metagenomic Classification

Lipovac, J.; Angevin, L.; Krizanovic, K.

2026-05-20 bioinformatics 10.64898/2026.05.18.725856 medRxiv
Top 0.1%
14.3%
Show abstract

Metagenomic classification at the strain level remains challenging due to high sequence similarity among closely related genomes, which leads to ambiguous read mappings and frequent false-positive strain detections. Reducing such errors improves the reliability of strain-level analyses, which is critical for applications such as pathogen detection. We introduce StrainRefine, a post-mapping refinement method that analyzes read-reference mapping profiles to resolve ambiguous assignments among highly similar genomes. The method represents candidate reference genomes using binary profiles that capture read-support patterns and measures similarity between references based on profile overlap. The method clusters references based on similar mapping profiles, filters weakly supported genomes, and reassigns reads to representative references, reducing redundant reporting of near-identical strains. StrainRefine substantially reduces false-positive strain detections while preserving recall and improving agreement between predicted and true abundance profiles. On large-scale metagenomic datasets, it achieves a substantially improved precision-recall balance compared to existing mapping-based approaches, with the standalone method obtaining the highest read-level classification accuracy on the most complex evaluated dataset. Unlike many strain-level tools designed for individual species, StrainRefine operates without prior assumptions about sample composition or curated species-specific reference collections, while still achieving comparable performance in single-species settings on species-specific reference databases. These results highlight mapping-profile similarity as an effective signal for improving strain-level metagenomic classification.

6
MAGI: Mechanistic Consequences of Genetic Variants via Genomic Foundation Models

Ofer, D.; Zok, S.; Linial, M.

2026-06-03 genetics 10.64898/2026.05.31.729117 medRxiv
Top 0.1%
14.3%
Show abstract

Clinical variant interpretation requires mechanism-aware evidence to guide diagnosis and clarify the biological consequences of mutations. However, existing computational predictors and genomic foundation models largely function as black boxes, providing pathogenicity labels with limited mechanistic insight or clinical actionability. Here, we present MAGI (Mechanistic Annotation of Genomic Impacts), a novel method that bridges this interpretability gap by unifying clinically relevant variant interpretation with mechanistic genomic analysis. MAGI pipeline leverages a genomic transformer model to quantify the effects of DNA variants across 3,623 functional tracks, encompassing regulatory features, multi-omics datasets, including tissue specificity and chromatin states, and 21 additional molecular annotations of genes and transcripts. These signals are integrated through a deterministic logic layer that maps single-nucleotide variants and indels to explicit molecular consequences. We benchmark MAGI-derived consequences against clinical rationales curated from ClinVar and observe strong concordance that scales with the magnitude of functional disruption. MAGI accurately recapitulates canonical pathogenic mechanisms, including start codon loss, splice site disruption, and regulatory element perturbation, consistent with ClinVar annotations. We further present case studies addressing conflicting or incomplete mechanistic interpretations, as well as variants requiring complex inference. Notably, MAGI is also applicable to non-human genomes and was evaluated on multispecies OMIA pathogenic variants. Collectively, MAGI establishes a generalizable framework that extends beyond clinical diagnostics to enable mechanistic discovery in functional genomics, generating mechanistically grounded, testable hypotheses for variants of uncertain significance (VUS) and variants with discordant clinical interpretations. In several cases, MAGI proposes alternative explanations that challenge existing annotations, providing transparent rationales and experimentally tractable predictions.

7
Synthetic RNA-seq cohorts for data sharing: a discovery-aware benchmark at transcriptome scale

NANDA, A.; Saha, S.

2026-05-26 bioinformatics 10.64898/2026.05.22.726357 medRxiv
Top 0.1%
14.1%
Show abstract

BackgroundSharing patient-level gene expression data is essential for translational discovery but carries documented re-identification risks. Bulk RNA-seq count matrices can retain genotypic signals and paired clinical metadata compounds this through quasi-identifier matching. Synthetic RNA-seq cohorts offer a complementary path for privacy-preserving data sharing, but the field lacks a multi-axis benchmark that probes biological fidelity and empirical privacy risk at transcriptome scale. Here we present a multi-axis benchmark framework that reflects how transcriptomic cohorts are used in translational practice. MethodsWe benchmarked three generative models across four cohorts drawn from datasets spanning oncology (TCGA-LUAD), sepsis (GSE184900), and pediatric IBD (RISK/GSE57945): dbTwin (a non-deep-learning, target-conditioned method that operates natively at RNA-seq scale), class-MVN (a low-rank target-conditioned multivariate Gaussian model), and PCA-CTGAN (a tabular GAN trained in PCA-compressed space). Synthetic cohorts were generated from training folds of a five-fold stratified design. We evaluated DE genes recovery, log2FC and significance (padj) concordance, held-out AUC (TSTR) and SHAP concordance and distance-based memorization risk. Resultsclass-MVN recovered 64.8% and 43.1% of real DE genes in the two binary cohorts with high fold-change correlation but lower significance concordance (r = 0.24-0.68) and inflated DE gene counts. dbTwin recovered 78.7% and 91.8% of real DE genes in the same cohorts, with high fold-change correlation and stronger significance concordance (r [≥] 0.88). Both methods matched held-out real AUC under TSTR, but SHAP agreement differed substantially: dbTwin preserved feature attribution patterns across cohorts (SHAP top-50 genes r = 0.84-0.99 across two binary and two multiclass cohorts), whereas class-MVN showed moderate performance for majority classes but degraded in multiclass and imbalanced settings (SHAP r = 0.31-0.79). PCA-CTGAN performed poorly across most DE and ML metrics. Distance-toclosest-record analysis did not indicate memorization by any of the models. ConclusionsWe introduced a multi-axis, transcriptome-scale, discovery-aware benchmark to validate synthetic RNA-seq cohorts for translational workflows and evaluated three generative models across four real-world cohorts. These results support the use of synthetic RNA-seq cohorts for exploratory analysis and method development, while emphasizing the need for careful validation before use in higher-stakes applications. All benchmark code and data are available at https://github.com/Nanda-Aditya/rna-syn-bench.

8
TAMIPAMI: Software and methods for PAM/TAM identification for CRISPR and OMEGA gene editing systems

Orosco, C.; Jain, P. K.; Rivers, A. R.

2026-05-16 bioinformatics 10.64898/2026.05.15.725432 medRxiv
Top 0.1%
14.1%
Show abstract

Protospacer adjacent motifs (PAMs) and target-adjacent motifs (TAMs) are essential for target recognition by CRISPR-Cas and TnpB nucleases. Here we present TAMIPAMI, an efficient experimental and computational framework for rapid PAM/TAM identification. TAMIPAMI requires only a single control library and Cas or TnpB-treated library, simplifying experimental design, reducing cost, and providing greater accessibility for users. The platform interprets sequencing data with interactive visualizations and introduces a novel algorithm that determines the minimal exact set of degenerate IUPAC sequences describing the observed PAM/TAM patterns. Using this approach, we accurately recovered canonical motifs for several nucleases, including SpCas9, LbCas12a, AsCas12a, BrCas12b, Cas12i1, and AmaTnpB. TAMIPAMI is available as both a web application and command-line tool, ultimately providing an accessible and efficient platform for PAM/TAM discovery and characterization across CRISPR and OMEGA systems.

9
Evaluation of Active Learning Selection Strategies and Characterization of Informative Sequences for Sequence-to-Expression Models

Qian, J.; Rafi, A. M.; Cazottes, E.; de Boer, C.

2026-05-26 genomics 10.64898/2026.05.21.727038 medRxiv
Top 0.1%
14.0%
Show abstract

DNA sequence-to-expression models have advanced rapidly, yet they still generalize poorly beyond their training distribution, limiting their use for tasks such as variant effect prediction. Active learning has improved data efficiency across many machine learning domains, but no large-scale study has benchmarked selection strategies for sequence-to-expression models using real experimental data or characterized the sequences they select. We benchmarked six active learning strategies across diverse model architectures, datasets, and configurations. All strategies outperformed random sampling, with uncertainty-based methods performing best. Most of the gains achievable through many small acquisition rounds could be matched with fewer, larger rounds, making lab-in-the-loop workflows experimentally practical. Different strategies selected substantially overlapping sets of sequences that occupied distinct regions of sequence space and were enriched for higher expression, specific dinucleotide compositions, and denser transcription factor binding sites. Nevertheless, active learning consistently outperformed selection based on these biological properties alone, indicating that informativeness is not fully captured by any single feature. Together, our results establish active learning as a critical tool for improving sequence-to-expression models, identify biological signatures of informative sequences, and lay the foundation for iterative lab-in-the-loop refinement.

10
reComBat-seq: Regularized negative binomial regression for batch-effect correction in underdetermined transcriptomics datasets

Stoyanova, Z.; Malzl, D.; Menche, J.

2026-05-30 bioinformatics 10.64898/2026.05.27.728166 medRxiv
Top 0.2%
12.7%
Show abstract

MotivationBatch effect correction is essential for the integration of large-scale transcriptomics datasets such as single-cell RNA-seq or multi-study bulk RNA-seq datasets for reducing technical noise that may mask biological signal. Existing correction methods, either do not produce count data output which is crucial for state-of-the-art downstream analyses such as differential expression analysis or fail to converge in underdetermined study designs. ResultsWe present reComBat-seq, a method that extends the Negative Binomial regression framework of ComBat-seq by incorporating Elastic Net regularization. This approach resolves problems with rank-deficient design matrices while also preserving the integer nature of count data. Benchmarking on simulated and real datasets such as single-cell RNA-seq data demonstrates that reComBat-seq successfully removes batch effects in complex study designs while maintaining compatibility with downstream differential expression tools. Availability and ImplementationreComBat-seq source code can be found at https://github.com/menchelab/reComBat-seq. All code to reproduce the presented analyses can be found at https://github.com/menchelab/reComBatseq_Studies. Data produced in this study is available at https://doi.org/10.5281/zenodo.19736515. Used single-cell RNA-seq data can be found at https://doi.org/10.5281/zenodo.14234956. Supplementary InformationProofs and volcano plots of differential expression analysis

11
MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders

Wijaya, A. S.; Leung, H.; Yoo, H.

2026-05-28 bioinformatics 10.64898/2026.05.25.727711 medRxiv
Top 0.2%
12.4%
Show abstract

MotivationFrozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each genes natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts. ResultsIn 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 /{kappa} 0.821, compared with 0.672 /{kappa} 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context. Availability and implementationSource code: https://github.com/Austin-Senna/dna_to_text; Python [≥]3.11. Contactasw2215@columbia.edu Supplementary informationSupplementary tables, figures, and reproducibility details are included at the end of this preprint.

12
DanioDecima: A DNA sequence-to-function model of zebrafish embryogenesis

Voges, M. J.; Kim, Y. J.; Frank, M.; Iovino, B.; Senbabaoglu, Y.; Royer, L. A.

2026-05-31 genomics 10.64898/2026.05.29.728876 medRxiv
Top 0.2%
12.4%
Show abstract

Deep learning DNA sequence-to-function models offer the promise of gaining mechanistic insights into genome regulation, however their performance is often limited by data scarcity in the species of interest. We present DanioDecima, a zebrafish-specific model leveraging transfer learning from human and mouse-trained models to predict tissue- and cell-type-specific gene expression during zebrafish embryogenesis. Initializing DanioDecima with pretrained human and mouse Borzoi and Decima weights raises the median pseudobulk Pearson r sub-stantially across cell-types and improves gene-level correlations of test set genes. An in silico directed-evolution loop guided by DanioDecima scoring generated synthetic promoters whose motif architectures cluster by the expected target lineage. These findings exemplify a cross-species transfer learning methodology for sequence-to-function models, and position DanioDecima as a practical resource for zebrafish regulatory engineering.

13
PerturbPlan: An analytical framework for designing Perturb-seq experiments

Niu, Z.; He, Y.; Galante, J.; Gschwind, A. R.; Ray, J.; Steinmetz, L. M.; Engreitz, J. M.; Katsevich, E.

2026-05-23 genomics 10.64898/2026.05.22.727199 medRxiv
Top 0.2%
12.4%
Show abstract

CRISPR screens with single-cell RNA-seq readouts provide a powerful tool for characterizing the functions of noncoding elements and genes. However, designing these experiments to balance statistical power and cost is challenging, given the large number of design parameters. The only available tool for this purpose is a simulation-based power calculator, but it is computationally costly and requires high-performance computing to run. We derive a novel analytical formula for the power to detect perturbation-expression associations, recapitulating power estimates from the simulation-based tool while reducing runtime by up to seven orders of magnitude. This acceleration unlocks the possibility of interactive single-cell CRISPR screen design. Accordingly, we develop PerturbPlan, an interactive web application built on the analytical power formula. PerturbPlan helps users address 11 design questions for two types of single-cell CRISPR screens, Perturb-seq and targeted Perturb-seq (TAP-seq). We apply PerturbPlan to carry out a comparative analysis of three recent Perturb-seq designs, demonstrating how optimal design varies across experiments of different scales. We also use PerturbPlan to quantify the cost savings of a recent TAP-seq study relative to a hypothetical Perturb-seq study assaying the same perturbations, illustrating how the tool can inform decisions about targeted versus whole-transcriptome readouts. In sum, PerturbPlan is the first tool to facilitate flexible and interactive design of well-powered single-cell CRISPR screen experiments.

14
POCKET-seq enables genome-wide profiling of on- and off-target transcriptional regulation events by dCas9-KRAB during CRISPR interference experiments

Joyce, C. M.; Kramer, G. D.; Vu, J. T.; Tavasoli, K. U.; Gardner, B. M.; Richardson, C. D.

2026-06-02 genetics 10.64898/2026.06.01.729390 medRxiv
Top 0.2%
12.3%
Show abstract

CRISPR interference screens use catalytically inactive dCas9 fused to a repressor domain to enable genetic perturbations at the transcriptomic level. Interpretation of results involves identification of guide RNAs associated with the screen phenotype, followed by secondary analysis. During validation of a genetic screen, we observed different phenotypes from non-overlapping guide RNAs targeting one gene. Here, we developed POCKET-seq to map the binding of dCas9 genome-wide. We show that off-target binding occurs frequently and can generate false-positive interactions when it occurs near the promoter of genes associated with the screen phenotype. POCKET-seq classifies these false-positive and true-positive interactions using gene ontology.

15
simPIC: flexible simulation of single-cell ATAC-seq paired-insertion counts from individuals to populations

Chugh, S.; Shim, H. S.; McCarthy, D. J.

2026-05-15 bioinformatics 10.1101/2025.09.21.676689 medRxiv
Top 0.2%
12.2%
Show abstract

Single-cell Assay for Transposase Accessible Chromatin (scATAC-seq) is increasingly used at population scale to study how genetic variation shapes chromatin accessibility. Method development is limited by the lack of flexible simulation tools with known ground truth. Here, we present simPIC, a fast, memoryefficient framework for simulating realistic single-cell ATAC-seq count data across individuals and populations. simPIC models cell groups, batch effects, and genotype-dependent accessibility variation, enabling controlled evaluation of population-scale methods, including chromatin accessibility quantitative traits locus (QTL) mapping. Across multiple datasets and cell types, simPIC closely matches real data distributions while scaling to cohort sizes impractical for current tools.

16
ATLAS: a scverse-compatible package for multi-omic single-cell trajectory inference integration

Leclercq, A.; Martini, L.; Bardini, R.; Savino, A.; Di Carlo, S.

2026-05-27 bioinformatics 10.64898/2026.05.23.727175 medRxiv
Top 0.2%
12.1%
Show abstract

Single-cell trajectory inference is widely used to study cellular differentiation and fate decisions, yet most existing approaches rely on transcriptomic information alone, limiting their ability to capture the regulatory processes underlying cell-state transitions. This work presents ATLAS (Advanced Trajectory Learning from multi-omics At Single-cell resolution), a scverse-compatible framework for trajectory inference in paired single-cell RNA-seq and ATAC-seq data. ATLAS integrates transcriptomic and chromatin accessibility information through Weighted Nearest Neighbor graphs, enabling both molecular layers to jointly inform pseudotime estimation, terminal-state identification, and fate probability inference within a unified multi-omic representation. Across synthetic and real datasets, ATLAS reconstructs coherent developmental trajectories, captures progressive fate commitment, and resolves biologically meaningful lineage structures, demonstrating the effectiveness of multi-omic integration for characterizing cellular dynamics. In addition, ATLAS enables the joint exploration of transcription factor expression and target gene activity along pseudotime, providing direct access to regulatory programs and chromatin-associated transitions that are not detectable from transcriptomic data alone. Overall, ATLAS provides a scalable and biologically informative framework for studying dynamic cellular processes in single-cell multi-omics experiments.

17
Hierarchical refinements of cis-regulatory inputs improve scalable gene expression prediction

Zhang, Q.; Xing, M.; Liao, Q.; Li, Z.; Huang, D.-S.

2026-06-02 bioinformatics 10.64898/2026.05.31.729151 medRxiv
Top 0.3%
12.0%
Show abstract

Deciphering the relationships between cis-regulatory elements (CREs) and target gene expression has long been a challenging problem in molecular biology. However, predicting gene expression from hundreds of candidate cis-regulatory elements (cCREs) requires models that scale to long, noisy inputs while retaining interpretable regulatory structure. Existing Transformer-based approaches typically attend over all nucleotides and all surrounding cCREs, diluting causal signals when hundreds of elements compete for limited model capacity. Here we introduce a two-stage selective framework (TSSF) that performs hierarchical refinements: nucleotide-level masking within each cCRE, followed by cCRE-level selection around each gene, implemented with information-bottleneck priors and a fully Transformer-based architecture. Across 70 human cell types and tissues, TSSF and lightweight variants improve expression prediction and enhancer-gene prioritization relative to strong baselines, including on cross-cell-line and cell-type-specific benchmarks. Prediction-stratified analysis motivates a distance-decay prior that aligns attention with long-range regulatory geometry, and chromatin-contact augmentation improves recovery of distal links. Motif analyses of high-confidence predictions recover proximal and distal regulatory programs, supporting mechanistic interpretability. TSSF offers a general strategy for scalable, interpretable modeling of high-dimensional regulatory inputs in genomics.

18
MirMachine 2: a scalable, evolutionarily informed pipeline for microRNA annotation and comparative genomics across thousands of animal genomes

Paynter, V. M.; Umu, S. U.; Tierney, J. A. S.; Tricomi, F. F.; Haggerty, L.; Fromm, B.

2026-05-21 bioinformatics 10.64898/2026.05.19.726197 medRxiv
Top 0.3%
10.3%
Show abstract

Genome sequencing is rapidly outpacing the annotation of conserved regulatory elements, limiting the evolutionary and comparative insights that can be extracted from expanding genome collections. MicroRNAs are among the most conserved and phylogenetically informative genes, yet automated annotation has remained difficult to scale while preserving evolutionary interpretability. Here we present MirMachine 2, an evolutionarily informed framework that combines curated reference models, lineage-aware scoring, and adaptive filtering to enable robust genome-wide microRNA annotation at scale. Applying this to thousands of animal genomes reveals that many apparent absences of conserved microRNAs reflect methodological bias rather than biological loss, particularly in underrepresented lineages. By enabling consistent and interpretable comparison of microRNA complements across large datasets, MirMachine 2 establishes scalable microRNA annotation as a practical foundation for genome-scale evolutionary and comparative genomics.

19
CMAPS: Causal Mediation Analysis of Perturbation Screens with Application to Genome-scale Perturb-seq Data

Duan, J.; Kang, H.; Keles, S.

2026-05-23 genomics 10.64898/2026.05.21.726924 medRxiv
Top 0.3%
10.3%
Show abstract

CRISPR-Cas9 perturbation screens coupled with single-cell multi-omic profiling enable dissection of gene regulatory mechanisms, yet existing analyses largely quantify total perturbation effects and offer limited insight into the molecular intermediates that transmit these effects. We introduce CMAPS (Causal Mediation Analysis for Perturbation Screens), a semiparametric framework for robust mediation analysis that accommodates unmeasured mediator-outcome confounding and incorporates an adaptive bootstrap test with false discovery rate control. Simulations and data-driven computational experiments show that CMAPS yields accurate, calibrated mediation estimates and robust mediator identification, as confirmed through negative controls and permutation-based validation. Applied to K562 Perturb-seq, CMAPS recapitulates transcriptional cascades downstream of GATA1. In BT16 MultiPerturb-seq data, CMAPS identifies promoter-centric, enhancer-distributed, and mixed cis-regulatory programs linking chromatin remodeling factors to transcriptional responses. CMAPS provides a rigorous and interpretable framework for mechanistic inference in single-cell perturbation screens. CMAPS is implemented in R and is available at https://github.com/keleslab/CMAPS.

20
Embryo-scale Visual Cell Sorting reveals a conserved transcriptomic signature of nucleolar size linked to proteostasis

Kim, H.-J.; Pendyala, S.; Lin, J.; Cooke, T.; Li, G.; Noble, W. S.; Deng, X.; Disteche, C. M.; Trapnell, C.; Fowler, D.

2026-05-26 genomics 10.64898/2026.05.25.727721 medRxiv
Top 0.3%
10.3%
Show abstract

Nucleoli, nuclear speckles and other compartments regulate transcription, RNA processing, and chromatin organization within the nucleus, yet the relationship of their morphology to developmental gene expression programs in vivo is poorly understood. Here, we develop a high-throughput Visual Cell Sorting (VCS) workflow for fixed cells and nuclei that combines antibody-based photoconversion; GPU-accelerated, real-time image analysis; and three-level single-cell combinatorial indexing RNA-seq (sci-RNA-seq3) to link nuclear compartment morphology to single-nucleus transcriptomes at embryo scale. We use VCS to analyze and sort over 1 million mouse embryo-derived nuclei by nucleolar, nuclear speckle, or nuclear size and construct a transcriptional atlas annotated with nuclear compartment phenotypes. Nuclear compartment size varies both between and within lineages and is shaped by proliferation and differentiation. In extracellular matrix protein-producing cell types such as fibroblasts, chondrocytes, and osteoblasts, nucleolar enlargement is uncoupled from cell cycle, and in erythroid cells exhibit a sharp nucleolar contraction preceding cell-cycle exit. We identify a 41-gene transcriptional signature whose expression tracks nucleolar size, enriched for ribosome biogenesis, mitochondrial metabolism, unfolded protein response, stress granule, and ubiquitin-proteasome pathway components. We used this nucleolar transcriptional signature to annotate mouse, zebrafish and human developmental atlases with nucleolar size information, revealing a conserved coupling between nucleolar activity and proteostasis programs. Our work establishes Visual Cell Sorting as a scalable platform for mapping image-based phenotypes to molecular programs; details the relationship between nuclear compartment phenotypes and development; and provides a transcriptional signature to estimate nucleolar size from existing single-cell datasets.